Exploratory Data Analysis

## # A tibble: 5 × 15
##   AUCTION_ID         TIMESTAMP DATE_UTC PUBLISHER_ID DEVICE_TYPE DEVICE_GEO_CITY
##   <chr>              <chr>     <chr>    <chr>              <int> <chr>          
## 1 0000060c-b8a9-414… 2025-10-… 2025-10… LteIcOiSsaE5           4 Salem          
## 2 0000060c-b8a9-414… 2025-10-… 2025-10… LteIcOiSsaE5           4 Salem          
## 3 0000060c-b8a9-414… 2025-10-… 2025-10… LteIcOiSsaE5           4 Salem          
## 4 0000060c-b8a9-414… 2025-10-… 2025-10… LteIcOiSsaE5           4 Salem          
## 5 0000060c-b8a9-414… 2025-10-… 2025-10… LteIcOiSsaE5           4 Salem          
## # ℹ 9 more variables: DEVICE_GEO_ZIP <chr>, PRICE <chr>, REQUESTED_SIZES <chr>,
## #   SIZE <chr>, RESPONSE_TIME <chr>, BID_WON <chr>, DEVICE_GEO_LAT <dbl>,
## #   DEVICE_GEO_LONG <dbl>, DEVICE_GEO_REGION <chr>

Univariate Analysis

Numerical Variables

Using the information_gain() function, we can evaluate how much each variable reduces uncertainty in predicting the target variable PRICE. The analysis indicates that REQUESTED_SIZES, BID_WON, and AUCTION_ID contribute the most information gain, while nearly all other variables exhibit relatively low, and roughly similar, values. Notably, the Region variable provides minimal information gain, which is expected given that the dataset is entirely restricted to Oregon.

Information Gain Scores
attributes importance
REQUESTED_SIZES 8.7732790
BID_WON 8.7732790
AUCTION_ID 7.2821643
TIMESTAMP 5.8811804
DEVICE_GEO_CITY 0.9168208
PUBLISHER_ID 0.7873141
DEVICE_TYPE 0.5318504
SIZE 0.5197298
DATE_UTC 0.2506063
DEVICE_GEO_ZIP 0.0000000
RESPONSE_TIME 0.0000000
DEVICE_GEO_LAT 0.0000000
DEVICE_GEO_LONG 0.0000000
DEVICE_GEO_REGION 0.0000000

Price

The histogram of bid prices is peaks greatly to the left of the plot, indicating that most bids are low in value. The summary statistics reveal several extreme outliers that create a right-skewed distribution, with prices reaching up to $141.25. The boxplot highlights a median bid price of approximately $0.20, with an interquartile range from $0.07 to $0.57, showing that the vast majority of bids are below $1.

Summary Statistics for PRICE
Min 1st Q Median Mean 3rd Q Max
7.1e-05 0.07 0.196 0.474 0.57 141.25

Response Time

The histogram of the Response Time variable exhibits a right-skewed distribution, with the mean response time of 201 milliseconds and a majority of observations falling between 50 and 400 milliseconds. The boxplot shows a median of 164 milliseconds and an interquartile range of 110 to 257 milliseconds. While the boxplot truncates numerous outliers, these higher values are evident in the histogram. Overall, the system typically responds quickly, with a smaller subset of observations exhibiting longer response times.

Summary Statistics for Response Time
Min 1st Q Median Mean 3rd Q Max
10 110 164 201.536 257 1893

Device Geo Latitude

The majority of latitude values fall between 45 and 46 degrees, with the minimum observed value being 42 degrees. The distribution is slightly left-skewed, peaking around 45.5 degrees. The boxplot shows a median latitude of 45.514, with a narrow interquartile range from 45.3 to 45.525. These findings are consistent with the fact that the bid data were collected exclusively within Oregon.

Summary Statistics for Latitude
Min 1st Q Median Mean 3rd Q Max
42.055 45.357 45.514 45.143 45.525 46.1879

Device Geo Longitude

The histogram of the longitude variable suggests an approximately normal distribution. A few outliers extend toward -118 degrees, but the majority of observations are concentrated around -122.5 degrees. The boxplot indicates a median longitude of -122.83, with an interquartile range from -122.83 to -122.59. This narrow span of 0.24 degrees captures the central 50% of the longitude values.

Summary Statistics for Longitude
Min 1st Q Median Mean 3rd Q Max
-124.5055 -122.829 -122.64 -122.65 -122.589 -116.8578

Categorical Variables

Device Type

Within the Device Type variable, two categories (Personal Computers and Tablets) dominate, together comprising 93.5% of the dataset. These device types therefore exert a strong influence on the overall distribution. The Connected Device category is the least common, representing only 0.07% of observations.

DEVICE_TYPE Count Percent
Personal Computer 229203 51.98
Tablet 183076 41.52
Connected TV 24639 5.59
Mobile/Tablet 3697 0.84
Connected Device 322 0.07

Device Geo City

Portland is the most frequently represented city in the dataset, accounting for nearly 60% of all observations. The next most common cities (Salem, Beaverton, and Hillsboro) each contribute roughly 3% of the data. Given Portland’s population density, it has a disproportionately large influence on the bid data compared to the other cities included.

DEVICE_GEO_CITY Count Percent
Portland 263884 59.85
Salem 14373 3.26
Beaverton 13998 3.17
Hillsboro 13603 3.09
Medford 11618 2.63
Eugene 9969 2.26
Gresham 9952 2.26
Other 103540 23.40

Device Geo Zip

The zipcode variable exhibits a relatively even distribution, with the most frequent zipcode representing only 10% of the data. Higher counts are observed for zipcodes in the range of approximately 24,000 to 45,000, while the least frequent zipcodes appear only once in the dataset.

DEVICE_GEO_ZIP Count Percent
97232 44792 10.16
97233 30832 6.99
97252 30014 6.81
97216 24449 5.54
97206 24392 5.53
97214 16353 3.71
97202 12248 2.78
Other 257857 58.38

Size

The most common ad sizes in the auction are 320x50 and 300x250, comprising 60% and 30% of all bids, respectively, for a combined total of 399,140 observations. The next most frequent size, 300x50, appears in 22,547 observations, representing approximately 5% of the dataset. A large number of other ad sizes occur very infrequently, each appearing fewer than 10 times.

SIZE Count Percent
320x50 267589 60.69
300x250 131551 29.83
300x50 22547 5.11
300x600 7282 1.65
728x90 6968 1.58
970x250 2500 0.57
300x251 320 0.07
Other 2180 0.48

Bid Won

The barplot of BID_WON reveals that losing bids substantially outnumber winning bids. Although the primary focus of the analysis is on the relationship between bid outcomes and price, it is notable that losses occur nearly three times as often as wins.

BID_WON Count Percent
FALSE 320429 72.67
TRUE 120508 27.33

Device Geo Region

As noted previously, all data points are located within the state of Oregon, restricting the analysis to this specific geographic region.

DEVICE_GEO_REGION Count Percent
OR 440937 100

Bivariate Analysis

Numerical vs Numerical

Price and Response Time

Several extreme values in the price variable deviate from the overall distribution. There appears to be a potential relationship where longer response times are associated with lower bid prices, suggesting that slower bidders may enter lower initial bids and subsequently lose to faster competitors. Additionally, some points are clearly separated from the main distribution, likely representing bidders who either submitted higher-priced bids initially or were able to quickly adjust and rebid at a higher price within the same auction. In the visualization depicting average price and response time by the top 5 publishers, most show a higher price in the 300 to 600 millisecond range. Beyond that point, the price of bid begins to diminish as response time increases, but the largest publisher remains steady at its high price during that time. Three of the top 5 also exhibit an increase in price of bid just after the initial response period, while the other two publishers show average bids of higher value in shorter response times.

Price and Latitude

The scatterplot of latitude versus price suggests a slight increase in bid prices around the 45.5-degree latitude mark. The higher concentration of points in this area are likely to be related to greater population density. Additionally, we observe a pattern similar to the Price versus Response Time graph, with some bids significantly higher than the main distribution. These points may represent a “second round” of bidding, where initial lower bids were rejected and subsequently raised. Plotting the mean price against lattitude bins, there is a clear spike in price at latitude 43.2, with large variance in the surrounding area. The mean price between 44.3 and 46.2 seems to be relatively stable by comparison, which shows there could be a geographic influence on price of bids.

Price and Longitude

Similar to the latitude plot, a large cluster of points is observed around -122.5 degrees longitude, likely reflecting areas of higher population density. Additionally, the pattern of higher-priced bids above the main distribution persists, suggesting a secondary set of elevated bids originating from these locations. When aggregating PRICE by longitude groups, it is clear that the price fluctuates. This variance further supports the possibility that regional effects are contributing to price differences.

Latitude and Response Time

The scatterplot of Latitude versus Response Time exhibits a generally consistent distribution of points across different latitudes, with the exception of a spike on the right side. There is a notable concentration of points around 45.5 degrees, and several other locations show outliers with unusually long response times.

Longitude and Response Time

The scatterplot shows a spike of higher response time values around longitude -123, with noticeably fewer points toward the right side of the plot. This may indicate that a specific location experiences slower response times, although it could also reflect lower sample density in that area. Additionally, there are no substantial outliers beyond longitude -121, suggesting a faster average response in these locations compared to the majority of the data.

Numerical vs Categorical

Price and Bid Won

A t-test, yielding an extremely significant p-value (2.2e-16), confirms a statistically significant difference in mean prices between won and lost bids. The accompanying boxplot illustrates the distribution of bid prices by outcome, clearly highlighting the separation between the two groups. For readability, the plot has been zoomed in to emphasize differences in median values and overall distribution. Notably, the median price of winning bids is substantially higher than that of losing bids, with a wider interquartile range, and the IQRs of the two groups do not overlap.

Price and Device Type

The boxplot of Device Type versus Price shows that the Tablet, Personal Computer, and Mobile/Tablet categories have comparable median prices. However, the interquartile range for Tablet is notably narrower, indicating lower variability within this group. In contrast, the Connected Device category exhibits both the highest median price and the widest IQR, reflecting substantially greater variability. This pattern may point to underlying factors contributing to the broader distribution of prices within this device type.

Median Bid Price by City

The visualization below illustrates the geographic distribution of median bid prices across Oregon. Cooler colors correspond to lower CMP values, while warmer colors (orange and yellow) indicate higher CMP levels. The largest circles show areas with the highest concentration of bids, most notably in northern Oregon around Portland. Despite the high bid volume in this region, the median CPM remains relatively low, typically between 0.2 and 0.3. Surrounding areas and parts of the Oregon coast exhibit noticeably higher median CPM values.

Response Time by City

This map shows median response times across Oregon cities. As in the previous map, cooler colors indicate lower response times, while warmer colors represent higher values. Portland, the most prominent city, appears on the cooler end of the spectrum, indicating a median response time lower than that of several outlying cities. Some nearby cities exhibit even lower response times, whereas a few locations in southern Oregon show comparatively higher values.

Device Type and Response Time

Most device types exhibit similar median response times, with the exception of the Connected Device category, which shows a higher median. All boxplots display outliers, and the Connected TV category also has a larger interquartile range. This variation may reflect differences in hardware or the way certain devices interact with the auction system. Some devices could be optimized differently, leading to slower bids. Confirming this hypothesis would likely require additional information about network traffic and other external factors affecting bidding performance.

Device Type and Timestamp

This heatmap visualizes Device Type activity across the available time window. Categories with relatively low overall activity (Connected Device, Connected TV, and Mobile/Tablet) exhibit a consistent pattern throughout the period, although some gaps are visible in Connected Device category due to the small number of observations. The Tablet category also shows consistent activity, maintaining medium counts for most of the period with a slight spike between 3 and 4 pm. Activity for Personal Computer is similar to Tablet until around 6 pm, after which it increases substantially and remains higher through the end of the window at approximately 10 pm.

Categorical vs Categorical

Device Type by City

This map shows the major cities across Oregon along with the dominant Device Type in each location. In most areas, Personal Computers are the predominant device, though in some cities, such as Medford and Portland, the dominant device is Tablet.

Key Findings

Conclusion

The analysis indicates that most bids are low in price, with a few extreme outliers producing a right-skewed distribution. Winning bids are consistently higher than losing bids, confirming that win rate is strongly associated with price. Device type and geographic location both influence bid price and response time, with Personal Computers and Tablets dominating the dataset. Although response time shows only minor variation across devices and regions, some outliers could affect bidding outcomes. High population areas, such as Portland, account for the majority of bids, yet median prices in these regions remain moderate.

Limitations

The analysis is limited by the dataset being restricted to Oregon, which may reduce the generalizability of findings to other regions. The high concentration of observations from Portland could skew overall results. Outliers in both price and response time introduce bias to mean-based measures, necessitating the use of medians for more robust insights. Additionally, two device type categories have some overlap, which may prevent a full understanding of the nuances in hardware or network performance.

Future Analysis

Future analyses could explore the causal factors influencing bid success, including device performance and response times. Expanding the dataset to cover a longer time period would allow for the investigation of temporal patterns in bidding behavior. Additionally, examining interaction effects between variables such as device type, geographic location, time of day, and bid price could provide deeper insights into the dynamics driving successful bids.